Skip to content

[Bug] Add dsv4 state_type branch to mooncake disaggregation#24878

Merged
ch-wan merged 1 commit intosgl-project:mainfrom
ch-wan:cwan/mooncake-dsv4-state-transfer
May 10, 2026
Merged

[Bug] Add dsv4 state_type branch to mooncake disaggregation#24878
ch-wan merged 1 commit intosgl-project:mainfrom
ch-wan:cwan/mooncake-dsv4-state-transfer

Conversation

@ch-wan
Copy link
Copy Markdown
Collaborator

@ch-wan ch-wan commented May 10, 2026

Motivation

PR #23882 introduced state_type="dsv4" for the new DeepSeek-V4 flat heterogeneous state pool (SWA + compress + indexer pools) and added a matching branch to NixlKVManager.maybe_send_extra. The mooncake sibling, MooncakeKVManager.maybe_send_extra, was never updated.

DSv4 disaggregated runs over the mooncake transfer backend silently fall through maybe_send_extra's final else: return 0 branch -- the SWA / compress / indexer state pool is never transferred from prefill to decode. The decode-side state buffers keep whatever stale data they were initialized with, producing wrong outputs whenever the model attends to that state.

Repro on GB300 disaggregated DSv4-Pro, /v1/completions with raw prompt_token_ids ending on the literal <think> token (id 128821):

  • Monolithic sglang: correct (gsm8k Janet question -> #### 18).
  • Disagg + NIXL: correct.
  • Disagg + mooncake (unpatched): wrong (model regurgitates an earlier few-shot answer, e.g. Weapon: ... #### 84).

The corruption is most visible when the prompt's last attended position lives in the SWA / indexer state pool. Other endings often look correct because the K/V cache itself does transfer via send_kvcache; the <think> ending is the worst-case noise location for that token's attention pattern.

Modifications

Add the missing dsv4 branch to MooncakeKVManager.maybe_send_extra that delegates to the existing _send_kvcache_generic -- the same helper the swa / nsa branches use. This routes DSv4's flat state pool through get_mla_kv_ptrs_with_pp, which iterates per-pool entry and uses prefill's state_item_lens for offset arithmetic.

Why delegation rather than a fresh per-page flat path:

  • The per-page flat path used by NIXL hard-asserts src_state_item_lens[i] == dst_state_item_lens[i] for every entry. With MTP enabled, decode's indexer-pool entry is 2x prefill's (decode carries the EAGLE-3 draft layer), so the assertion fires on the very first transfer and the engine never makes progress.
  • Delegating to _send_kvcache_generic matches what the staging DSv4 branch did before this code path was split into a dedicated dsv4 state_type. Prefill writes its half-size at the natural offset on decode; the decode-only MTP half is left untouched, which is correct -- decode populates its own draft state.

The diff is purely additive and bails for asymmetric TP between prefill and decode (matching the constraint of the surrounding swa / nsa path). Existing mamba / swa / nsa / none paths are unchanged.

Validation

GB300 1P+1D DSv4-Pro disagg, mooncake backend, with the patched conn.py hot-mounted into lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034.

Variant Pre-patch Post-patch
Non-MTP <think>-ending repro returns wrong gsm8k answer (Weapon: ... #### 84) <think>-ending repro returns correct #### 18. sa-bench (10 prompts, conc=1, ISL=8192/OSL=1024) completed cleanly: TTFT ~950 ms, TPOT ~11.5 ms, output throughput ~80 tok/s.
MTP (EAGLE-3, num-steps=3, draft-tokens=4) NIXL hits state_item_lens mismatch assertion on the first transfer; mooncake unpatched silently corrupts Engine accepted prefill chunks at ~1000-1400 tok/s, decode batches ran with accept rate 0.66-0.70 and ~190 tok/s gen throughput. Zero state_item_lens asserts, zero transfer-worker errors.

Sample post-patch response for the original <think>-ending repro on mooncake:

We are given: "Janet's ducks lay 16 eggs per day. She eats three for breakfast every
morning and bakes muffins for her friends every day with four. She sells the remainder
at the farmers' market daily for $2 per fresh duck egg..."

Eggs laid: 16 per day - 7 = 9 eggs.
She sells these at $2 each, so daily earnings: 9 * $2 = $18.
Answer: 18.</think>

Checklist

cc @hnyls2002 (PR #23882 author).

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ch-wan ch-wan force-pushed the cwan/mooncake-dsv4-state-transfer branch from 5ebc8be to 47da450 Compare May 10, 2026 07:27
@ch-wan ch-wan marked this pull request as ready for review May 10, 2026 07:41
@ch-wan
Copy link
Copy Markdown
Collaborator Author

ch-wan commented May 10, 2026

/tag-and-rerun-ci

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could elif state_type == "dsv4" branch join here as well?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — folded into the existing branch (force-pushed). The new diff is +7/-1: just adds "dsv4" to the existing state_type in ["swa", "nsa"] list, with a one-line comment about how the compressed-MLA PP/MTP layout is already handled by get_mla_kv_ptrs_with_pp. Thanks!

PR sgl-project#23882 introduced ``state_type="dsv4"`` for the new DeepSeek-V4 flat
heterogeneous state pool (SWA + compress + indexer pools) and added a
matching branch to ``NixlKVManager.maybe_send_extra``, but the mooncake
sibling was never updated. As a result, DSv4 disaggregated runs with the
mooncake transfer backend silently fall through ``maybe_send_extra``'s
final ``else: return 0`` branch -- the SWA / compress / indexer state is
**never** transferred from prefill to decode, and the decode-side state
pool keeps whatever it was initialized with, producing wrong outputs
whenever the model attends to that state.

Repro shape on GB300 disaggregated DSv4-Pro:
- Same byte-identical ``prompt_token_ids`` ending on the literal
  ``<think>`` token (id ``128821``).
- Monolithic sglang: correct (gsm8k Janet question -> ``#### 18``).
- Disagg + **NIXL**: correct.
- Disagg + **mooncake**: wrong (model regurgitates an earlier few-shot
  answer, e.g. ``Weapon: ... #### 84``).

The corruption is most visible when the prompt's last attended position
lives in the SWA / indexer state pool, which the missing branch silently
leaves untransferred. Cases where attention happens to land entirely in
the K/V pool transferred via ``send_kvcache`` don't surface the bug.

## Modifications

Add the missing ``dsv4`` branch to ``MooncakeKVManager.maybe_send_extra``
that delegates to the existing ``_send_kvcache_generic`` -- the same
helper the ``swa`` / ``nsa`` branches use. This routes DSv4's flat state
pool through ``get_mla_kv_ptrs_with_pp``, which already understands the
compressed-MLA PP/MTP layout (so the MTP/nextn tail entry on the decode
side is naturally accommodated).

The diff is purely additive and bails for asymmetric TP between prefill
and decode (matching the constraint of the surrounding ``swa`` / ``nsa``
path). Existing ``mamba`` / ``swa`` / ``nsa`` / ``none`` paths are
unchanged.

## Validation

GB300 1P+1D DSv4-Pro disagg, 8k/1k sa-bench (10 prompts, conc=1) on
mooncake with the patched ``conn.py`` hot-mounted into the latest
nightly container. Engine bring-up clean, no transfer-worker errors,
median TTFT ~950ms / TPOT ~11.5ms / output throughput ~80 tok/s. The
pre-patch ``<think>``-ending repro that produced ``Weapon: ... #### 84``
on mooncake now matches the NIXL output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ch-wan ch-wan force-pushed the cwan/mooncake-dsv4-state-transfer branch from 47da450 to fd7e9ca Compare May 10, 2026 07:51
@ch-wan
Copy link
Copy Markdown
Collaborator Author

ch-wan commented May 10, 2026

End-to-end MTP validation on the consolidated commit (fd7e9ca):

GB300 1P+1D DSv4-Pro disagg, mooncake backend, with EAGLE-3 MTP enabled (speculative-num-steps=3, speculative-num-draft-tokens=4), patched conn.py hot-mounted into lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034. sa-bench, conc=1, ISL=8192/OSL=1024, 10 prompts:

============ Serving Benchmark Result ============
Successful requests:                     10
Output token throughput (tok/s):         136.02
Total Token throughput (tok/s):          1217.77
Mean TTFT (ms):                          977.80
Mean TPOT (ms):                          6.32
Mean E2EL (ms):                          6837.80

Compared to the non-MTP run on the same setup (80 tok/s output, 11.5 ms TPOT), MTP is delivering its expected ~1.7× speedup, which is direct evidence the indexer/draft state pool is being transferred correctly — the failure mode this PR is fixing would either fire a state_item_lens assert or silently corrupt outputs. Neither occurred.

So:

  • non-MTP correctness — verified earlier by re-running the original <think>-ending repro and getting the correct gsm8k Janet Answer: 18 (vs the unpatched mooncake Weapon: ... #### 84).
  • MTP correctness + speedup — verified by this run.

Both backends (NIXL pre-existing + this mooncake addition) now have working dsv4 state-pool transfer.

@ch-wan ch-wan merged commit c7f674e into sgl-project:main May 10, 2026
47 of 105 checks passed
ch-wan added a commit to SemiAnalysisAI/InferenceX that referenced this pull request May 10, 2026
Picks up sgl-project/sglang#24878 (merged as c7f674e4),
which adds the missing dsv4 state_type branch to
MooncakeKVManager.maybe_send_extra. Combined with the prior
revert of #1297's nixl switch (commit daa6785), the mooncake
backend now correctly transfers DSv4's flat heterogeneous
state pool for both non-MTP and MTP runs.

Validated on GB300 1P+1D: comp_with_think.json (the prompt
ending on the literal `<think>` token that previously surfaced
the corruption) now returns the correct gsm8k Janet answer
(`#### 18`) on mooncake disagg, matching mono and the NIXL
control. MTP sa-bench delivers ~136 tok/s output throughput
(~1.7x non-MTP), confirming draft acceptance is working.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants